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SUPERIORITY OF FIT 



1. INTRODUCTION 

Do discrepancies between observations and predictions indicate 
true population differences? A statistical test, which in the 
Neyman-Pearson formulation [4] can answer this question either 
yes (with the risk of a Type I error) or no (with the risk of a 
Type II error) , can also, in the Fisher formulation [1, Chapter 2] , 
fail to answer the question (an insignificant result) or, answering 
it, answer it only in the affirmative (a significant result) . 

Neyman [3] reviews the controversy between these two opposing 
formulations. Though the tendency over the years has been increas- 
ingly to adopt the Neyman-Pearson formulation in both textbooks 
and research reports, the practice in specific subject-matter areas 
has not always been consistent. In psychology, for example, while 
textbooks typically present the Neyman-Pearson formulation, research 
reports continue to reflect the influence of Fisher in such state- 
ments as "The result is significant (p < .05)" or "The result is 
not significant (p > .05)," where p indicates the probability 
that the result (or a more extreme result) is simply due to 
sampling error. The purpose here, however, is not to evaluate 
either formulation, especially relative to the other, but rather 
to present a hybrid formulation applicable particularly to the 
evaluation of numerical predictions. 

2. A HYBRID FISHER - NEYMAN-PEARSON FORMULATION 

This formulation rests on a widespread belief among philo- 
sophers of science [e.g., 2, Chapter 4, particularly p. 78] that 
no numerical prediction is precisely accurate. Testing the 



accuracy of a single numerical prediction thus makes no sense: 

Use of a large enough sample will always lead to the rejection 
of the prediction as inaccurate. To rule out tests of goodness 
of fit, however, is not to rule out tests of superiority of fit. 
Testing the relative accuracy of two different numerical predic- 
tions on the same set of observations does make sense. The null 
hypothesis (Hq) of such a test (the hypothesis to be nullified, 
in Fisher's terminology) is simply that the two predictions are 
equally inaccurate. Equal inaccuracy implies that the population 
value is midway between the two predictions so that their mean is 
itself a precisely accurate prediction. The null hypothesis of 
equal inaccuracy must thus be false. 

If this hypothesis is false, however, then one of the two 
predictions must be more accurate that the other. Deciding that 
one prediction is more accurate than the other when the reverse 
is true is thus an all-inclusive error having unconditional, or 
total, probability*? 

a T = a l P l + a 2 P 2' (2-1) 

where (i=l,2) is the conditional probability of incorrectly 

deciding that prediction i is less accurate and (i=l,2) is 

the (prior) probability that prediction i is in fact more accur- 
ate. Fairness to both predictions requires that a x = a 2 = a 
so that 

a T = a (P^ + P 2 ) . (2.2) 
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If equal inaccuracy is impossible, P + P = 1, and thus a = a: 

-L T 

The total probability of error is equal to either one of the two 
equal conditional probabilities of error. 

This formulation thus resembles Fisher's in its exclusion of 
the acceptability of Hq and Neyman-Pearson ' s in its inclusion of 
the probability of error. 

3. TESTING SUPERIORITY OF FIT 

Application in the form of a statistical test requires speci- 
fication of the null hypothesis and sequential data collection until 
the rejection of this hypothesis occurs. 

Since the null value (9) is midway between the two predicted 
values (0^ and 9^) , the null hypothesis is Hq : 0 = (9^ + / 2 . The 

equality sign in this hypothesis shows that the test is two-tailed. 
Rejection of Hq occurs when the test statistic falls in either tail 
of the sampling distribution that the test statistic would have if 
Hq were true. The decision that follows, that one or the other 
prediction is more accurate, depends on which tail this is. Either 
decision has a probability of error equal to a, the area under 
each tail, which is also the total probability of error. 

Sequential data collection is necessary to avoid the acceptance 
of Hq, which, according to the belief that no prediction is pre- 
cisely accurate, is impossible. Sampling thus proceeds one sampling 
unit at a time. Computation of the test statistic (or an equivalent 
value) T follows the sampling of each sampling unit along with the 
determination (if necessary) of appropriate critical values, t^ 
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and t 9 (t. < tj. The decision depends on the relationship between 

T and t^ and ' If T < t^, the decision is that prediction 1 

is more accurate than prediction 2; if T > t 2 , the decision is tha 
prediction 2 is more accurate than prediction 1. If t^ < T < t 2 , 
however, the decision is to continue sampling. 



4. AN ILLUSTRATION OF THE METHOD 



On the first day of school, an instructor of a large class 
administers a preliminary examination consisting of many items, 
each scorable as correct or incorrect. Automatic scoring immedi- 
ately following the examination shows that the mean proportion of 
items answered correctly is it. Knowing it , the instructor then 
asks a student selected randomly from the class a question selecte 
randomly from the examination in a process that continues, if nec- 
essary, for a parallel succession of students and questions until th 
answer received is correct. Recording the number (X) of answers 
received prior to the correct answer, the instructor repeats the 
process for trial after trial in order to determine whether a geo- 
metric distribution with parameter it or a Poisson distribution wi 
parameter (1 - iO/tt more accurately predicts the variance of X. 
Each of these distributions predicts the same mean: u = (1 - iO/tt. 

2 2 

If the variance of X is a ^ for the Poisson and a 2 for th 

2 2 2 

geometric distribution, the null hypothesis is Hq : a = (a^ + a 2 ) / 
where 



= (1 - it) /it 



(4.1) 



and 
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2 

°2 



(1 - u ) /ir^ . 



14.2) 



The test statistic is chi square divided by its degress of freedom: 

N 2 2 

T = l (X - y) /Nct , (4.3) 

which (y being known) has N degrees of freedom after the sampling 

2 

of N sampling units. Using y and a as constants, an elec- 
tronic calculator determines T for each pair of X and N values, 
and the instructor plots these values on a graph on which two lines 
join the critical values t^ and for successive values of N. 

Figure A shows the results for u = . 2 and a T = .05. After five 

-f m 1 2 Jj> 

.questions* yielding the succession of X values 2, 5, 4, 6, and 4, 
the instructor rejects Hq, deciding with a probability of error 
equal to .05 that the Poisson distribution predicts the variance 
more accurately than the geometric distribution. 

5. DISCUSSION OF THE ILLUSTRATION 

Since the succession of incorrect and correct answers consti- 
tutes a Bernoulli process, the distribution of X ought to be geo- 
metric, not Poisson. The result, however, is not one of the five 
errors that can occur in every one hundred repetitions of the test. 
The illustration is fictitious. The product of simulation, the five 
observations tend in fact to follow a Poisson distribution with 
parameter equal to 4. The histogram in Figure B describes this 
distribution. As a general rule, for samples as small as five, the 
probability of error approximates its nominal value to the extent 
that the observations follow a normal distribution. Since the histo- 
gram in Figure B tends to be unimodal and symmetrical like a normal 






FIGURE A. TEST STATISTIC (DOTS) AND CRITIC 
VALUES (t AND AS A FUNCTIO 

OF TRIAL NUMBER (n) 



FIGURE B. HISTOGRAM OF POISSON 

DISTRIBUTION WITH 
PARAMETER 4 



PROBABILITY 



.20 





NUMBER OF INCORRECT ANSWERS 
PRIOR TO CORRECT ANSWER 





curve, therefore, the total probability of error ought to be close 
to its nominal value of .05 despite the small sample size. 

Such early rejection of Hq in a sequential test is ordinarily 
not so defensible. If the sampling distribution of X were closer 
to the geometric than the Poisson, for example, the histogram in 
Figure B might be skewed sufficiently to have a substantial effect 
on the total probability of error, particularly for low N. Since 
five observations were necessary to reject Hq in favor of a 
Poisson distribution even when the distribution of the observations 
was in fact Poisson, however, a sample considerably larger than five 
would likely be necessary to reject Hq inappropriately in favor of 
a Poissop distribution. The requirement of large samples for the 
occurrence of error keeps the test honest. Regardless of the form 
of the distribution of observations, the probability of error gener- 
ally tends more and more closely to approximate its nominal value as 
samples increase in size. 

Statistics other than the variance tested here are, of course, 
also possible targets of inference in tests of superiority of fit. 
The number of sampling units required, however, may depend on the 
statistic chosen for testing. This number generally ought to be 
smaller for predictions that are far apart than for predictions that 
are close together. The two distributions compared here thus allowe 
no choice: The predictions of the mean were equal, and the predic- 

tions of the variance were particularly far apart (4 versus 20) . 
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